By: Oswaldo Rodriguez and Hayu Mohammed
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.0 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.1 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggsignif)
library(rstatix)
##
## Attaching package: 'rstatix'
##
## The following object is masked from 'package:stats':
##
## filter
library(ggplot2)
library(knitr)
The NBA has a vast amount of data and machine learning capabilities. One of the biggest parts of the NBA is the player statistics which is based on big data and machine learning. Teams benefit from that data that shows field goal percentages, win percentage and more. The player stats like points per game, 2 pt percentage and more help better understand the vast majority of players. We use data to crown the NBA’s next Most valuable players. In our research, we are curious on the statistics and which data points make a player stand out. We will look into clustering and better understanding how the grouping of players and their stats are in the NBA. We will further explain and understand the clustering of NBA player statistics.
Our data that was gathered from Kaggle is appropriate for clustering. Our data provides so much statistics on the players. Our data includes 30 important variables that are important to understand players and how they rank in the league. It has stats on the players age, points per game, field goal percentage and many more. These data points are the vocal points of understanding which players are the best performing and how they correlate. We will continue to use this data to create clusters and better understand the players in each position.
players <- read_delim("2022-2023 NBA Player Stats - Regular1.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
## Rows: 679 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (3): Player, Pos, Tm
## dbl (27): Rk, Age, G, GS, MP, FG, FGA, FG%, 3P, 3PA, 3P%, 2P, 2PA, 2P%, eFG%...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Lets take a look at our data first to see what variables are correlated with each other.
player1 <- subset(players, select = -c(Rk, Player, Pos, Age, Tm))
cor(player1[,1:25])
## G GS MP FG FGA FG%
## G 1.0000000 0.65530291 0.6062581 0.4907191 0.48573129 0.18267369
## GS 0.6553029 1.00000000 0.8014897 0.7346554 0.72153121 0.16134855
## MP 0.6062581 0.80148974 1.0000000 0.8757879 0.88290073 0.15323425
## FG 0.4907191 0.73465540 0.8757879 1.0000000 0.97568434 0.23766110
## FGA 0.4857313 0.72153121 0.8829007 0.9756843 1.00000000 0.08708725
## FG% 0.1826737 0.16134855 0.1532342 0.2376611 0.08708725 1.00000000
## 3P 0.3565542 0.48029225 0.6849064 0.6590881 0.74746369 -0.11101999
## 3PA 0.3661949 0.49669704 0.6995480 0.6680963 0.77637344 -0.17981189
## 3P% 0.1272412 0.05700937 0.2029705 0.1825614 0.20630770 0.06623292
## 2P 0.4465517 0.69258697 0.7746405 0.9404527 0.87065801 0.34405921
## 2PA 0.4497262 0.69901364 0.7951085 0.9475333 0.91121410 0.24080252
## 2P% 0.1758294 0.11314956 0.1368069 0.1895788 0.08242423 0.80375211
## eFG% 0.1960654 0.14541555 0.1860777 0.2247070 0.10473943 0.92405910
## FT 0.3944579 0.62827193 0.7172438 0.8659390 0.84384966 0.15989592
## FTA 0.3981964 0.63231384 0.7138577 0.8630177 0.82899386 0.20355505
## FT% 0.3688532 0.23588424 0.3734949 0.3224097 0.34748000 0.05477713
## ORB 0.3123659 0.40425053 0.3826654 0.3537043 0.23729016 0.43922781
## DRB 0.4811881 0.66023042 0.7418288 0.7362039 0.66393905 0.32633322
## TRB 0.4627653 0.62623360 0.6798334 0.6664053 0.57465499 0.38674025
## AST 0.3392265 0.56060334 0.7344346 0.7096414 0.73957527 0.02615284
## STL 0.3808913 0.54634471 0.7196038 0.5670234 0.58759064 0.02153608
## BLK 0.2993322 0.40265802 0.3901352 0.3865036 0.28887006 0.34110632
## TOV 0.4044454 0.63234885 0.7810125 0.8411052 0.83944898 0.13493412
## PF 0.4843352 0.59874741 0.7398938 0.6097673 0.57288484 0.27814131
## PTS 0.4848691 0.72662414 0.8733188 0.9917998 0.98102183 0.19099928
## 3P 3PA 3P% 2P 2PA 2P%
## G 0.35655422 0.36619486 0.127241231 0.44655175 0.44972619 0.175829361
## GS 0.48029225 0.49669704 0.057009372 0.69258697 0.69901364 0.113149559
## MP 0.68490639 0.69954802 0.202970501 0.77464053 0.79510850 0.136806871
## FG 0.65908814 0.66809631 0.182561399 0.94045268 0.94753330 0.189578801
## FGA 0.74746369 0.77637344 0.206307699 0.87065801 0.91121410 0.082424230
## FG% -0.11101999 -0.17981189 0.066232920 0.34405921 0.24080252 0.803752112
## 3P 1.00000000 0.97994595 0.429880301 0.36481293 0.42000758 -0.018284401
## 3PA 0.97994595 1.00000000 0.358647059 0.38462869 0.44796113 -0.038466275
## 3P% 0.42988030 0.35864706 1.000000000 0.03114739 0.05920008 -0.001236038
## 2P 0.36481293 0.38462869 0.031147390 1.00000000 0.98376621 0.242587214
## 2PA 0.42000758 0.44796113 0.059200084 0.98376621 1.00000000 0.141859692
## 2P% -0.01828440 -0.03846627 -0.001236038 0.24258721 0.14185969 1.000000000
## eFG% 0.10797617 0.01696363 0.312183488 0.22910320 0.13751704 0.755808713
## FT 0.47385741 0.49884889 0.108517505 0.85837695 0.87115071 0.104849173
## FTA 0.41937071 0.44596503 0.065730218 0.87947158 0.88456150 0.129232258
## FT% 0.36784346 0.37128520 0.262119583 0.23325103 0.25067026 0.172412080
## ORB -0.13830594 -0.14274861 -0.227050335 0.50009836 0.42933351 0.296386980
## DRB 0.30302566 0.31521027 0.011022772 0.77458388 0.73572077 0.245830110
## TRB 0.18258080 0.19041776 -0.064661726 0.74247801 0.69046677 0.280318049
## AST 0.53259871 0.55615519 0.157067385 0.63770622 0.68583757 -0.002678873
## STL 0.45062772 0.46451050 0.148819964 0.49827468 0.52993540 -0.005901233
## BLK -0.01690208 -0.02163976 -0.116342594 0.48657804 0.42377072 0.247837093
## TOV 0.52188190 0.54679734 0.119565031 0.80547744 0.83364621 0.071682295
## PF 0.33714467 0.34738106 0.065744071 0.60285355 0.58581371 0.231319451
## PTS 0.70513035 0.71467236 0.209946018 0.90965175 0.92467156 0.156070292
## eFG% FT FTA FT% ORB DRB
## G 0.19606543 0.3944579 0.39819645 0.36885316 0.31236587 0.48118815
## GS 0.14541555 0.6282719 0.63231384 0.23588424 0.40425053 0.66023042
## MP 0.18607773 0.7172438 0.71385774 0.37349489 0.38266540 0.74182879
## FG 0.22470699 0.8659390 0.86301772 0.32240968 0.35370432 0.73620387
## FGA 0.10473943 0.8438497 0.82899386 0.34748000 0.23729016 0.66393905
## FG% 0.92405910 0.1598959 0.20355505 0.05477713 0.43922781 0.32633322
## 3P 0.10797617 0.4738574 0.41937071 0.36784346 -0.13830594 0.30302566
## 3PA 0.01696363 0.4988489 0.44596503 0.37128520 -0.14274861 0.31521027
## 3P% 0.31218349 0.1085175 0.06573022 0.26211958 -0.22705033 0.01102277
## 2P 0.22910320 0.8583770 0.87947158 0.23325103 0.50009836 0.77458388
## 2PA 0.13751704 0.8711507 0.88456150 0.25067026 0.42933351 0.73572077
## 2P% 0.75580871 0.1048492 0.12923226 0.17241208 0.29638698 0.24583011
## eFG% 1.00000000 0.1121539 0.13472670 0.13878676 0.27489821 0.25593875
## FT 0.11215393 1.0000000 0.98530183 0.32686610 0.27857659 0.63185191
## FTA 0.13472670 0.9853018 1.00000000 0.26883859 0.35789126 0.67035689
## FT% 0.13878676 0.3268661 0.26883859 1.00000000 0.01346378 0.21122019
## ORB 0.27489821 0.2785766 0.35789126 0.01346378 1.00000000 0.68675117
## DRB 0.25593875 0.6318519 0.67035689 0.21122019 0.68675117 1.00000000
## TRB 0.28075680 0.5638973 0.61856750 0.16245513 0.83714192 0.97219368
## AST 0.03344974 0.6510405 0.63636213 0.25570716 0.10293013 0.49230072
## STL 0.04206830 0.4750819 0.46512198 0.23060554 0.20498827 0.44583586
## BLK 0.22550838 0.3376281 0.38229291 0.07557093 0.61871038 0.59323142
## TOV 0.09778705 0.7833097 0.78812286 0.25117350 0.29056375 0.66712629
## PF 0.23970794 0.4870333 0.51003940 0.23313431 0.56436232 0.70698333
## PTS 0.19938038 0.9019242 0.88969673 0.34999349 0.29672605 0.70553568
## TRB AST STL BLK TOV PF
## G 0.46276534 0.339226470 0.380891300 0.29933223 0.40444535 0.48433515
## GS 0.62623360 0.560603338 0.546344708 0.40265802 0.63234885 0.59874741
## MP 0.67983342 0.734434610 0.719603805 0.39013519 0.78101246 0.73989380
## FG 0.66640530 0.709641431 0.567023389 0.38650365 0.84110518 0.60976730
## FGA 0.57465499 0.739575265 0.587590640 0.28887006 0.83944898 0.57288484
## FG% 0.38674025 0.026152844 0.021536083 0.34110632 0.13493412 0.27814131
## 3P 0.18258080 0.532598710 0.450627719 -0.01690208 0.52188190 0.33714467
## 3PA 0.19041776 0.556155189 0.464510497 -0.02163976 0.54679734 0.34738106
## 3P% -0.06466173 0.157067385 0.148819964 -0.11634259 0.11956503 0.06574407
## 2P 0.74247801 0.637706219 0.498274676 0.48657804 0.80547744 0.60285355
## 2PA 0.69046677 0.685837567 0.529935403 0.42377072 0.83364621 0.58581371
## 2P% 0.28031805 -0.002678873 -0.005901233 0.24783709 0.07168229 0.23131945
## eFG% 0.28075680 0.033449737 0.042068295 0.22550838 0.09778705 0.23970794
## FT 0.56389728 0.651040524 0.475081858 0.33762806 0.78330974 0.48703328
## FTA 0.61856750 0.636362126 0.465121985 0.38229291 0.78812286 0.51003940
## FT% 0.16245513 0.255707156 0.230605537 0.07557093 0.25117350 0.23313431
## ORB 0.83714192 0.102930131 0.204988275 0.61871038 0.29056375 0.56436232
## DRB 0.97219368 0.492300721 0.445835857 0.59323142 0.66712629 0.70698333
## TRB 1.00000000 0.402766539 0.400147328 0.64471315 0.59440854 0.71231629
## AST 0.40276654 1.000000000 0.637376584 0.11621712 0.82760847 0.46677862
## STL 0.40014733 0.637376584 1.000000000 0.21286795 0.55924715 0.54092298
## BLK 0.64471315 0.116217117 0.212867952 1.00000000 0.29934738 0.52216924
## TOV 0.59440854 0.827608473 0.559247154 0.29934738 1.00000000 0.62155352
## PF 0.71231629 0.466778624 0.540922977 0.52216924 0.62155352 1.00000000
## PTS 0.62500961 0.720075529 0.568756007 0.34946448 0.84240155 0.58718657
## PTS
## G 0.4848691
## GS 0.7266241
## MP 0.8733188
## FG 0.9917998
## FGA 0.9810218
## FG% 0.1909993
## 3P 0.7051303
## 3PA 0.7146724
## 3P% 0.2099460
## 2P 0.9096518
## 2PA 0.9246716
## 2P% 0.1560703
## eFG% 0.1993804
## FT 0.9019242
## FTA 0.8896967
## FT% 0.3499935
## ORB 0.2967261
## DRB 0.7055357
## TRB 0.6250096
## AST 0.7200755
## STL 0.5687560
## BLK 0.3494645
## TOV 0.8424015
## PF 0.5871866
## PTS 1.0000000
PTS is highly correlated with GS, MP, FG, FGA, FT, FTA, DRB, TR, AST, STL, TOV, PF, FGA, 3P, 3PA, and 2P.
First, we organize the data to look for optimal numbers of clusters. Clusters are important to data machine learning and unsupervised learning. We use the clusters to better understand players by analytic clustering. The clustering graph helps us better understand the correlation between any two statistics and better understand how they relate. We wil continue to create clusters and better understand the NBA player data.
players_scaled = players %>%
select(-Player, -Pos, -Tm, -Rk, -Age, -G, -GS, -MP, -FG, -FGA, -"FG%", -"3P", -"3PA", -"3P%", -"2P", -"2PA", -"2P%", -"eFG%", -FT, -FTA, -"FT%", -ORB, -DRB, -TRB, -STL, -BLK, -TOV, -PF) %>%
scale()
fviz_nbclust(players_scaled, kmeans, method = "wss")
fviz_nbclust(players_scaled, kmeans, method = "silhouette")
fviz_nbclust(players_scaled, kmeans, method = "gap_stat")
set.seed(1)
cluster = kmeans(players_scaled,
centers = 3,
nstart = 25)
fviz_cluster(
cluster,
data = players_scaled,
main = "Players Segmented by Points and Assists",
repel = TRUE
)
cluster$size
## [1] 203 84 392
It is important to better understand this cluster. We have created a cluster comparing two of the biggest tracked statistics in the league. Points and assists are one of the most important factors in evaluating scoring and play making that makes up a player. We can see that the vast majority of players are in the blue section which is the lowest points and assist. Great and efficient players like Ja Morant, Lebron James and Kevin Durant all fall within the green zone. The green zone is the players that are high scoring and great at play making. This data is important for MVP and award winning voters to understand and compare players.
players_scaled1 = players %>%
select(-Player, -Pos, -Tm, -Rk, -Age, -G, -GS, -MP, -FG, -FGA, -"FG%", -"3P", -"3PA", -"3P%", -"2P", -"2PA", -"2P%", -"eFG%", -FT, -FTA, -"FT%", -ORB, -DRB, -TRB, -PTS, -AST, -TOV, -PF) %>%
scale()
fviz_nbclust(players_scaled1, kmeans, method = "wss")
fviz_nbclust(players_scaled1, kmeans, method = "silhouette")
fviz_nbclust(players_scaled, kmeans, method = "gap_stat")
set.seed(1)
cluster1 = kmeans(players_scaled1,
centers = 3,
nstart = 25)
fviz_cluster(
cluster1,
data = players_scaled1,
main = "Players Segmented by Steals and Blocks",
repel = TRUE
)
When evaluating players defense is usually the part of the game that is over looked. When looking at the data we understand that the disparity between blocks and steals are wider. When looking at the comparison between blocks and assists we can see that players with high blocks usually dont tend to have high steals. Positions like center and power forward who are built taller and stronger have a easier ability to block versus the agile ability to steal.
Now we will filter the positions and analyze the clusters. Here we are taking a look at point guards.
playersg <- filter(players, Pos == "PG")
players_scaled_g = playersg %>%
select(-Player, -Pos, -Tm, -Rk, -Age, -G, -GS, -MP, -FG, -FGA, -"FG%", -"3P", -"3PA", -"3P%", -"2P", -"2PA", -"2P%", -"eFG%", -FT, -FTA, -"FT%", -ORB, -DRB, -TRB, -STL, -BLK, -TOV, -PF) %>%
scale()
fviz_nbclust(players_scaled_g, kmeans, method = "wss")
fviz_nbclust(players_scaled_g, kmeans, method = "silhouette")
fviz_nbclust(players_scaled_g, kmeans, method = "gap_stat")
set.seed(1)
clusterg = kmeans(players_scaled_g,
centers = 3,
nstart = 25)
fviz_cluster(
clusterg,
data = players_scaled_g,
main = "Point Guards Segmented by Points and Assists",
repel = TRUE
)
We are creating a new column here.
playersg$cluster <- as.integer(clusterg$cluster)
ggplot(data = playersg,
aes(x = scale(PTS), y = scale(AST), color = factor(cluster), size=2.5)) +
geom_point()
Looking at the clusters above we can better understand the point guard position in the NBA. Point guards are the playmakers of the team. As we can see in the graph, most point guards are in the lower section of the graph towards the assists. What makes a great point guard is the ability to score and also make assists. The green section of the cluster is where the league’s best point guards with a high number of points and assists.
Here we are taking a look at power forwards.
playerspf <- filter(players, Pos == "PF")
players_scaled_pf = playerspf %>%
select(-Player, -Pos, -Tm, -Rk, -Age, -G, -GS, -MP, -FG, -FGA, -"FG%", -"3P", -"3PA", -"3P%", -"2P", -"2PA", -"2P%", -"eFG%", -FT, -FTA, -"FT%", -ORB, -DRB, -TRB, -STL, -BLK, -TOV, -PF) %>%
scale()
fviz_nbclust(players_scaled_pf, kmeans, method = "wss")
fviz_nbclust(players_scaled_g, kmeans, method = "silhouette")
fviz_nbclust(players_scaled_g, kmeans, method = "gap_stat")
set.seed(1)
clusterpf = kmeans(players_scaled_pf,
centers = 3,
nstart = 25)
fviz_cluster(
clusterpf,
data = players_scaled_pf,
main = "Power Forwards Segmented by Points and Assists",
repel = TRUE
)
We also decided to look into the specific position of point guard. According to NBA statistics the power forward position has the 2nd most won MVP’s in the league. The power forward position gives the ability to be a scorer and playmaker in our league. The size and length of power forwards give them the advantage to be a great defensive player but also the quickness and shot of a point guard. As our league evolves we are noticing positions like 6’8 and above power forwards and centers have the ability to play like 6’2 point guards.
Now lets take a look at steals and blocks of a power forward.
playerspf1 <- filter(players, Pos == "PF")
players_scaled_pf4 = playerspf1 %>%
select(-Player, -Pos, -Tm, -Rk, -Age, -G, -GS, -MP, -FG, -FGA, -"FG%", -"3P", -"3PA", -"3P%", -"2P", -"2PA", -"2P%", -"eFG%", -FT, -FTA, -"FT%", -ORB, -DRB, -TRB, -PTS, -AST, -TOV, -PF) %>%
scale()
set.seed(1)
clusterpf4 = kmeans(players_scaled_pf4,
centers = 3,
nstart = 25)
fviz_cluster(
clusterpf4,
data = players_scaled_pf4,
main = "Power Forwards Segmented by Steals and Blocks",
repel = TRUE
)
playerspf1$cluster <- as.integer(clusterpf4$cluster)
ggplot(data = playerspf1,
aes(x = scale(STL), y = scale(BLK), color = factor(cluster), size=2.5)) +
geom_point()
In conclusion, the enormous quantity of NBA player data and machine learning’s potential are crucial for improving our comprehension of player performance. We can improve our comprehension of player dynamics and make more informed decisions on team strategy, player evaluation, and award considerations by looking at clusters. The clustering approach can give us important information about player performance. We can better grasp the relationships between various statistics and how they relate to player groups by using visual representations. Furthermore, our analysis extends to defensive aspects often overlooked in player evaluations. Players excelling in one area may not excel in the other. Aside from that, centers and power forwards display an inclination for blocking over stealing. This underscores the importance of considering positional differences in defensive capabilities.